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^Purpose 




The purpose of this paper is to delineate several problems which 
arise when criterion referenced test results ate used to evaluate the 
*effe;cts of a specific educational treatment. Specifically^ the paper 
deals with: (1) alternative^ methods of aggregating individual student^ 
and group data on "?>bjec tivBS , (2) the sensitivity of the instruihent 
to irogram outcomes, and (1) the comparisons of criteriQn referenced 
test data and standardized achievement test data. 

Background 

* During the past decade, there has been ej^tensive discussion of 
the merits of criterion referenced testing as an alternative to norm 
referenced tests (Popham & Husek, 1969; Hkmbleton & Novick, L^72; and 
Gronlund, 1973). While criterion referenced tests have been'defined 
in a .multitude of ways; an underlying thread amo'ng all of these 
definitions is the .assumption 'that criterion referenced tests are, 
deliberately constructed so as, to yield measurements that are directly 
tinterpretable in terms of spi^cified performance standards (Glaser 6f 
Novick; 1971^^ fti spite of this assumption of direct intierpretability , 
very little clear direction -ig gi^ren in the literature of specific 
ways in which criterion referenced test results have been used practicably 
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.to evaluate either student progress or program outcomes. P6pham ^ 
Husek (1969) recommend that a number of schemes to report the group's 
performance be employed in order to^ permit more enlightened inter- 
pretations; for example, the number of individuals who achieve the 
criterion, traditional descriptive statistics such as the mean and' 
standard deviation, and an average "percentage correct.*' Knipe • ' 

Krahmer ( 1973) present '^student by objective grids as unsophisticated 
ways of detecting different learning patterns." Gronlund (1973) 

recominends that criterion referenced test resul^s.vbe interpreted 

% V 

cautiously* • ' -f- , ; 

Empir^ical examples of criterion referenced test results reported 
in the literature "(Hsu, 1971; Roudabush, 1973 ; and Roudabush 6r'Green, 1971) 
have focused on the improvement of; criterion referenced test items,- 
rather than the u^e of the d^ta' for Instructional "decisions . The 
extensive- discussion of critei:ion referenced, test reliability and - 
errors^of measurement (Millman et'^ al, 19751 suggests that criterion ^ ' 
referenced test results may be far from directly interpretable . ^ 

Statement of , Problems , * ^ . . . . • 

The lit erature -has contained many articles about the controversy 
between crit^ion referenced fest* (CRT) and norm reference<i tests. 
Mpst of these articles were based on fhe conceptual and theoretical* 
differences between these test3. Few' of the articles, made objective 
coipparisons based on Empirical data . 

Both norm referenced and criterion referenced tests are designed 
to make decisions about individuals or programs. The decision may be 
mt' of selection or ooo of i mp'rov(^munL . In the w,asc» of norm reftmiced 
tests, the decisions are made in reference to .the per fornirince of 



normjftive groups of individuals or programs placed in the same' decision 
situation. In the case of criterion referenced tests, the decision, is 
critically related to a comparison of the individual's performance with 
an arbitrarily established standard of performance or criterion level. 
This latter point becomes important when the decisions are made on test 
items which are not obviously norm-distinctive or criterion-distinctive. 
The items on two types of tests are, in fact, more often interchangeable 
than not. ' 

Unlike other papers on criterion referenced tests and norm referenced 
tests, this paper is ^ased on empirical data collected concurrently with 
both a Criterion Referenced Test and a Norm Referenced Test. Some of the 
questions that the study will adfdress are: 

1. If one reports and aggregates criterion referenced test data in 
different ^ , would the results be consistent? 

2. Is the criterion referenced test sensitive to -the changes that 
" occur in students? 

3. Are the estimates o^ the program effects based on criterion 
referenced test results .and standardized test results comparable? 

>letho ds i ^ 

DciLa were collected on a ^;roup of 182 f ourth , ' fl f th , and sixth grade 
students located in two elementary schools within thfe Cincinnati Public 
School Systiem. These students were selected for this study because of 
their involvement in a cqmmercially prepared reading comprehension and 
verbal skills ' curriculum. 

The curriculum Is , an individuaLizeti , self-paced program for the 
development of reading skills. Each student proceeds at his own pace 
with a prescribed set of learning materials and activities provided in 
the reading learning center. ^ 
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EacR student participating in the program was tested with a 
commfercially prepared Criterion Referenced Test in November, 1974, and 
again in May, 1975, The Criterion Referenced Test was designed by a ^ 
reputable educational* testing firm to assess the objectives of the 
^curriculum* Each student was also tested with an appropriate level 
of the reading subtest of the Metropolitan Achievement Test in, 
October, 1974, and April, 1975, 

The level of the Criterion Referenced Test that was given to the 
students was determined by their score on a short screening test. There 
were three levels (I, II, and III) of the Criterion Referenced Test, It 
was possible for a fourth grade student to take the highest level (III) n 
of the Criterion Referenced Test, and it was just a^ possible for a 
sixth grade student to take the lowest level test (I), Data on^ the 
criterion refereu^^ed test and the standardized achievement test were 
analyzed separately according to these groups. 

Each test level included 'different objectives. Although there was 
an overlap of objectives at each J.evel, the test items at each level 
measuring the objectives were different. Table I shows the obiectives 
included at each level and also the number 'bf items measuring each, 
objective at each level, 

istery^of an objective was determined for students who had 75 percent 
or more of the^' items correct for the objective. Student progress through 
the curriculum was determined by the' same criterion. Table 2 indicates 
the rules used in determining the mastery for each objective* 

, ,# ' , ■ • . 
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lablc 1. Objectives at Each Level of the C-ritetion Referenced Test and 
the Number of Items Measuring Eaeh Objective. 



Level I 



Level n 



Level III 



Objective (?/ of Items) . 

Let ter Recogn ition (2) 

Initial , Final ' 
Sounds (5) 

Vowel Sounds (4) 

j^onsonant So^Wds (6) • 

Word Endings (3) 

Other (6) 

Sentence Compre- 
hension (2) 

Main Theme (3) 

Specific Detail (3) 

Sequence (3) 

Drawing Inferences (2) 

Author^s Intent, View- 
point (3) 

Word Meanings (3) 

Special Usage (2) 



Ob jec t ive (^/ of Items) 

Sentence Comprehension (2) 

Contextual Cues (3) 

Main Theire (5) * 

Specific Detail (5) 

Sequence (4) 

Drawing Inferences (5) 

Author's Intent, View- 
point (4) 

Word Meanings (5) 

Special Usage (4) 

Follow Directions (3) 

Understand iag 
Structure (1) 



if 



Objective (/^ of Items) 

Contextual Cues (I) 

Main Theme (5) 

Specific Detail (5) 

Sequence (3) 

Drawing Inferences (3) 

Autho r ' s In t en t , ' Vie w- 
point (3) 

Word Meanings (4) 

Special Usage (3) 

Follow Directions (3) 

Interpret Charts, 
Graphs (3) 

Understanding 
Structure (3) 

Use Content Classi- 
^ fiers (3) 

Paragraph Meaning (4) 



.Table 2. Rules tor Determining Mastery. 



Number cf Items Testing an Objective 


1 


2 


3 


4 


5 


,6 


Items Required, for Mastery 


I 


2 


3 


3 or. 4 


4 or 5 


5 or 6 
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There are standard ways of reporting and aggregating standardized 
achievement test scores for individuals as well as for groups. In this 
study, the gains on the standardized achievement test were obtained by 
subtracting the pretest standard score from the posttest standard score 
for each child. The standard score gains were averaged within levels 
of the criterion referenced test groups. 

However, there is no st'ahdard rule tor presenting criterion referenced 
test results. Several alternative methods of analyzing criterion referenced 
test data are possible in terms of the group's status at the beginning 
and the end of instruction or gains made during that period. The s|ptus 
of pupils on either the pretest or postCest could be displayed as either 
the percentage of items correct or the number of objectives mastered by 
each child. Gain lou'^'^ be calculated either as an increase in the number 
of items ans^vred correctly or in terms of change in the number of 
ob jec t ives' naste red , Figure 1 describes the type of raw data that can be 
aggregated to produce objective-Based gain scores. 

Figure 1. Pre and Post Chaages on the Criterion Referenced Test. 



t Posttest 



Pretest 



+ 


A 


. j 




B 





-f Ob ject ive mastered. 

Objective not mastered 



Cell A are those students who maintained mastery. 

Cell B are those students who recently mastered. 

Cell C are those students who lost mastery. 

Cell D are those students who never mastered. 

(\*11.^ A -f -f C -f I) ^ T, the^ iMU i re student population. 
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In this study, gain scores for individuals on the criterion referenced 
test were calculated in two ways: first, ^a simple raw item gain between the 
pretests and posttests; and second, as net gain in , ob j ec t ives mas tered * (B ~ C) 
by each student between the pretests and postt;ests. Each of these gains 
was correlated with each other and with the gain in standard scores on the 
s t andardized-^^&h4:eveT[iHfi't'''*'iE est. 

Gain scores for groups of individuals were calculated in four ways: . 

1, Gross Mastery in Total: the percentage' ^of students who achieved 

mastery on the posttest /A + """^ 



2. Grose- Gain in Non-Masters: the number of students gaining ma'stery 

as a -percentage of thfe non-mastery group 
on the pretest 



\B + D 

3. Gain in Total: the number of students gaining mastery as a percentage 

of the tot^al group ^b\^ ^ 

4. Net Gain in Total': the number of students gaining Tnastery minus 

the number losing mastery as 4 percentage of the 
total -group /b - c \ 

\ / ■ 

Obviously, the data .^tild be reported for each Objective or aggregated 
over all of the objectives at each level, ^ The emphasis of this report i§ 
on^u^ing the' data for program evaluation. Therefore, the data is aggregated 
over^ndivi3l>s4,s^i»iid objectives for presentatioh , The average Values of 
A, B, C, and D in teV^ns of ohiectives for students in each groiip »were 
calc ul a tc^d . ' The data wt're also presented as percentage of mastery according 

■-• . ■ . ■ J 

Co • tht.' four data presentation mothods defined above. 
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R esul ts 

The' matrix of students-by-objectives on a criferion referenced test 
reported on a mastery or non-mastery basis^ clearly has a diagnostic 

instructional value. In an ins^uctional situation in' which all of the 

/ 

students begin with hon-mastery, the propo.rtion of studatits gaining ^ 

mastery across the instructional period becomes a measure of the impact 

/ 

of the program. In most group s-ituations, however, the assumption of 
non-ma&tery prior to instruction cannot be made. In relatively homo- 
geneous groupings of students, as might be obtained by using a screening 
device, some students will achieve mastery on initial testing, while 
^xother students will not. The proportion of^ non-masters (^D and B) who 
- subsequently achieve mastery (B) provides a relatively *optimistic estimate 

of the effects of the program. In these cases, the nufnb^r of individuals 

/ ^ ^ 

gaining mastery xs more revealing than the actual percentage of pUpils 

gaining mastery, since the percentage values can be, inflated if most 

of the students achieved mastery prior to the instructional sequence. 

The percentage of the total group mastering an objective across the 

instructional period provides a miore balanced measure of program impact. 

.However, it may look artificially low for those objectives mastered by 

* 

high percentages of students on the pre-measure. All of these measures 
of gai« in mastery fail to describe the impact of instruction in a course 
where significant. loss in mastery (C) occurs in those objectives assessed 
as mastered on the pre-measure. Table 3 displays the percentages of 
mastery under the above conditions with objectives accui^^ated over all 
students at the three levels. 
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Tabiu J. Percenta^'.e of Objectives by Group, by Alternative Methods. 



Alternative Methods for * 
Criterion Referenced Pest Data Analysis ' . 


Criterion 


Referenced 


Test, Levels. 


T t 

1. 


T T ' 


'r T T 




(;ross (Jain in Total 




45% 


37)3 


2) B 


Gross Gain in Non-Masters 


35 




^8 


D -f B 






) 




3) 

T 


New G/ain In Total 

> 


27 - 

i 


20 


JO 




, Net Gain in Tot^l 

. ._ ^ . „ , . ^ _ . 


20 


. 4 


10 ' ' 



^ In all cases, a significant proportion of the total number of objectives 

remained unmastercd on the post test . \Overal 1 , the ^students at each level 
mastered almost oii< ^uird o\ the objectives which were assessed as non- 
mastered oh the pretest (B/D -t- M . As might be expected, the level of 
total mastery found a smaller proportion of the objectives mastered (B/ D . 
When interest is focused on the net gain i,n mastery (B - C/T) , the pro- 
pore Lonal impact of t-he program becomes lesg' impressive^. 

Clearly, the method of ^ir^ort ing,^ mastery has an^it^ffect on the inter- 
pretation of these results*" Methods' which concent^ te on the impact on 

/ 

non-masters can clearly exaggerate Che cumulative impact on refined. 

/ " r 

learning (B/D -f B^vs . B - C/T) . 

11) e sum data can be expressed a's an average value lor all -stutlenls 
oh tiu' nu.nU it j-es. A, B, I), and Total usk>d to calculate the percentages 
m l.jbli' '» . ^ ** 
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•*'-?cjble A. Changes in Average Criterion Referenced Test Objectives, by 
Level. . * ^ 



Level 


Number 
-of Students 


of 


Total // 
Objectives 


Main tained 
Mastery 
■ A 


Recent ly 
Mas ter ed 
B 


Lost 
Mastery 
C 


Never 
Mastered 

D 


I 


70 


\ 


14 


2.3 


3.8 


.9 




II 




11 . 


2.7 


■ 2*. 2 


1,7 




,11/1 


12 




13 


2-1 


2.6 


1.3 


6.9 



The large increases in the average number of objectives lost across 
instruction (C) clearly affect the interpre|fa€"ion of the results as 
measures of program impact. • ^ 

The most commonly mentioned advantages of the criterion referenced 
te^^inf^ are their ...agnostic usefulness in targeting instruction to 
specific h6mogeneous objectives and their sensitivity as m^asuire^ of the 
effecta^of the program on these targeted , objectives. These advantages 
we re .maximized in th^ curriculum being assessed in the present study. 
Progress," through the pre-programmed curr^iculum was 'based on successful 
^ at tain-men t of mastery on items and c ri ter ia .levels which coincide with 
the items and criteria levels utilized in both the pretest and posttest . 

Table 5 ^ives the, pre, post*, and gain scores on the crit^Tv^on reference^ 
test in both items and objectives at- each level. These gains corres 
to the net gains (H - C) outlined in Table' By.^ ordinary measurement^^ 
standards, thtv criterion referenced test item gi^ins are, at best, modesty 
If these gains represent a more sensitive assessment of the true prg^^ram 
impact, then the weight of evidence borne b)j , ind i vi^dual items is V^x:xJ\i^t 
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Table 5. Pretest, Posttest,, and G^in Scores by Criterion Referenced Test 
Items and Objectives . \ 



Level N. 


Pretest * ' 


1 

1 Post test 


Gain 


Items Objective 


Items Objective 


Items Objective • 


I 70 

/ 

n 100 

/ 

III 12 


26.6 3.2 

24.8 4.4 

23.9 3.4 ' 


\ 

30,8' 6.1, 
26.0 4.9 * * 
25.3 4.7 


4.2 £.9 

1.3 .5 
1.3 1.3 



The fin-al question addx^asaed in the study is whether the results of 
criterion referenced testing give different estimates of impact than'would 
have been obtained from standardized test results. 

Table b^d^scribes the gain scores by criterion referenced test items, 
criterion referenc i test objectives, standardized test stand^r<i scores^ 
and grade equivalents. . _ 



Table 6. Mean* Gain Sciores for CRT and Standardized Test by Group. 









Standardized Test . 


Group 


N 


' CRT Objective 
Gain 


Standard Score 
Gain 


Grade Equivalent 
Gain 


r 




•*2.9 


3.0 


.2 years ^ . 


- I'l 


100 


V , .5 


5.0 ^ , 


.3 yea'rs 

■ f 


III 


12 ' 


1.3 


10.0 


1.4 years 



Tho comparison, of net s^'^l^s is quite different across the thr^o levels. 
Vhc crttorioii rotoreiired test results would suggest that the program was 
mosl siu'cesslul witli the lower level students, second best with the highest- 
levi^K and worst with the middle level students. *Wliether the CRT j;alns are t 
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posi^^Xije^r negative must be determin^id "in relation to sonve standard that 
presently^ is; not available,* The standardized test results indicate poor 
gains in reading comprehension for the lowest group; predictable, but not ^ 
outstanding gains for the middle group; and quite^ exceptional gains for 
the admittedly smaller highest group. On the surface, then, the criterion 
referenced tests do not give the same estimates of program effectiveness 
as would have been obtained from standardized tests. 

, • The gain scores on the Criterion Referenced Tests by item and 
objective were correlated with the gains in standard scores on the 
Standardized Test* within each group (Table 7)., 



Table 7. Intercorrelations Between Gain S^cores by Group, 



' Variable 


- — = — 

Group I 


Group II 


_ 

Group III 


1 '2 3 


1 2 3 _ 


1 .2 3 


1. 


Gain in Standard 










Score 


1.00 


1.00 


1 .00 


2 / 


Gain in CRT Items 


.00 1.00 


.14-1.00 


.15 1.00 




Gain in CRT 


.14 ^- .84 1.00 


'.17 .72 1.00 


.23 .42 1.00 




Objectives 









'The results suggest that the gains on the Standardized Test are 
unrelated to the gains in either items or .objectiyes on the Criterion 
Referenced Test. The gains in items and objectives on the CriteriQn^. 
Referenced Test were rather strongly correlated iii .two of the three 
groups. 
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D iscussion 

♦ » 

Admittedly, the situation under which the 4>resent data were collected 
deviates in njany respects from an experimental study. No control* was 
possible* on the Amount of the commercial curricula covered by each 
student/ The students participating in the program proceeded at an 
individualized pace through the material without regard to external 
grade level standards. Further, the decision to use the standardized 
reading comprehension scores fur comparison was purely based olr; the 
availability ^of d^ta. ■ ^ . ' ^ * ^ ' . 

The resulting' data are by all standards open to alternative explanations 
The program itself may not have been optimally implemented, o r. implemented 
in similar fashions in the two sites. The focus of the "study is on the 
concurrent assessment ^of the impact 6f the program, with two "type^" of 
instruments: criterion referenced test and standardized -ack^evement 
test . , ^ . , 

It is' clear that the manner in which cri*terion referenced^ test- rtisul>< 
are aggregated to measure program'imp^ct can effect the relative inter- 
pretation of the results. Cortcent ra tion on post'test scores of non-masters 
can have two-fold pernicious effect on the use of criterion referenced test 
results. First, this form of reporting tends to exaggerate the estimates 
of program ef fectiveness> It takes advantage of a form of a degression 
effect to the extent that non-masters can only get better whe-n assessed ' 
by items with questionable "reliability" to assess objectives against a, 
relatively ar^^itrary criterion. 

Anv valid mivnsure of objective-based K^in should include reabstvssment 
of "mastered objectives" and calVulaLion of a neL gain Ln mast*ery. I'he 
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sigpific^nt: amount of "lopt mastery" documented in the present paper 
dictates reassessment; of "presumed mastery." 

Anqther effect of "pre^ming 'inastery" is more directed at Che 

diagnostic use of criterion referenced test results. A student who 

' % . " ... 

achieves mastery of an objective on the basis of f,ive items may 

eliminated from further instruction or reinf orcejnent of Che skill ■ 
involved. If the assessed mastery status was incorrectly made, then 
the number of students who subsequently "lost mastery" inclyides a 
significant number of msidenti f ied students. If is also possible that 
"mastery^ in such a situation fs dependent upon continual use of the 
skill. The assumption that important learning is a one-time event does 
hot seem justified on the basis of existing learning theories. 

The present results 'dp not suggest that, crite rion referenced tests 
give the same evaluation regults'as standardized tests.' The gains oh 
the criterion referenced ^test were ^greatest at the lowest furTCTtrf^ 
level; the gains on the standardized test were greatest at the upper 
functional level. One could hypothesize on this rather weak evidence 
that criterion referenced tests are nig:Xe sensitive to gain in lower ^ 



lever skLll$, while standardized te(»fcs^are more sensitive to higher 
ones.** The hypothesis deserves testing 'in other situations where con- 
current data^ on standardized tests and cri terion^ referenced tests are 

avail.ihle. It seems likely that fundamental reading skills are more 

\ ' ^ . ' ' ' . / ' ' ' - 

consistent with a mastery learning todel than are.mpre complex behaviors 

The results show that the effects of an instructional program will 

not .ilwjvs be equa'l ly as.scssed by criterion referenced tjests and 

•.t .itui.ri J i .T^ irsts. It Ijas^Itot* lu-tMK roncl u.slvc'ly proven that criterion 

iiM rriMuod tests will show y,i\i\\ whrrc standardiziici tosts do not. Those 
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practitioners who turn to criterion referencec} tests as a gua^ranteed 
measure of "more positive results'* will be ' disappointed occasionally. 
The contention that learning outcomes are "adequately measure^}'! J>y 
comparison ot performance on some limited number of test items with 
-some essent.iall'y baseless criterion level seems ^t least as capricious 
as the basis on which the same decisions are made with standardized 
achievement tests. 
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